We analyse the random forest model trained in previous assignments on the FIFA 23 game dataset. The model aims to predict players' wages given their statistics in the game.
We use Interpretable Model-agnostic to analyse feature importance.
We analyse stability of the method.
We compare the LIME method with the SHAP method implemented during the previous assignment.
We also compare LIME explanations for the random forest and the linear model.
We compare LIME explanations for three observations and for each use five different seeds.
We observe that the method is quite stable - the most important features are usually the same across the seeds (note that Top 2 is the same for all observations), there are some differences in less important features.
The first seed:
The second seed:
The third seed:
The forth seed:
The fifth seed:
The first seed:
The second seed:
The third seed:
The forth seed:
The fifth seed:
The first seed:
The second seed:
The third seed:
The forth seed:
The fifth seed:
We observe that explanations of LIME and SHAP methods may differ even for the most important feautures (see comparison of the second chosen observation). Also Top 3 features are consistent only for one out of five observations (the forth chosen observation).
However, comparing the sets of Top 3 most important features, we observe that at least two elements overlap for all 5 observations. This suggest that both methods are useful in separating important and unimportant features.
LIME explanations:
SHAP explanations:
LIME explanations:
SHAP explanations:
LIME explanations:
SHAP explanations:
LIME explanations:
SHAP explanations:
LIME explanations:
SHAP explanations:
The most important features for the random forest model and the linear model usually differ. The tree model mostly focuses on Overall and Value in euro features, while the linear model often focuses on Stats and Position ratings.
We observe that the most important feautures are usually consistent between lime and dalex library. There are some differences in less important features.
The random forest explanation with dalex library:
The random forest explanation with lime library:
The linear model explanation with dalex library:
The linear model explanation with lime library:
The random forest explanation with dalex library:
The random forest explanation with lime library:
The linear model explanation with dalex library:
The linear model explanation with lime library:
The random forest explanation with dalex library:
The random forest explanation with lime library:
The linear model explanation with dalex library:
The linear model explanation with lime library:
The random forest explanation with dalex library:
The random forest explanation with lime library:
The linear model explanation with dalex library:
The linear model explanation with lime library:
The random forest explanation with dalex library:
The random forest explanation with lime library:
The linear model explanation with dalex library:
The linear model explanation with lime library:
# 1. Import libraries
!pip3 install shap
!pip3 install dalex
!pip3 install lime
!pip3 install -q condacolab
import condacolab
condacolab.install()
!conda install -c conda-forge python-kaleido
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly
import kaleido
import pickle
import lime
import shap
import dalex as dx
from math import isclose
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
# 2. Load dataset and models from the previous homework
with open('X_train.pickle', 'rb') as handle:
X_train_load = pickle.load(handle)
with open('y_train.pickle', 'rb') as handle:
y_train_load = pickle.load(handle)
with open('X_test.pickle', 'rb') as handle:
X_test_load = pickle.load(handle)
with open('y_test.pickle', 'rb') as handle:
y_test_load = pickle.load(handle)
with open('tree_model.pickle', 'rb') as handle:
forest_reg_load = pickle.load(handle)
with open('linear_model.pickle', 'rb') as handle:
linear_model_load = pickle.load(handle)
print(X_train_load)
print(y_train_load)
print(X_test_load)
print(y_train_load)
print(forest_reg_load.predict(X_train_load))
print(linear_model_load.predict(X_train_load))
print(forest_reg_load.predict(X_test_load))
print(linear_model_load.predict(X_test_load))
# 1. Observe predictions of 5 observations (POINT 1)
observations = X_test_load.sample(5, random_state = 1)
predictions_forest = forest_reg_load.predict(observations)
predictions_linear = linear_model_load.predict(observations)
print(observations)
print(predictions_forest)
print(predictions_linear)
# 2. Calculate lime decomposition for selected observations with lime library for the forest model (POINT 2, 5)
lime_explainer = lime.lime_tabular.LimeTabularExplainer(
training_data=X_train_load.values,
feature_names=X_train_load.columns,
mode="regression"
)
plot_id = 1
for i in range(len(observations)):
lime_explanation_forest = lime_explainer.explain_instance(
data_row=observations.iloc[i],
predict_fn=lambda d: forest_reg_load.predict(d)
)
_ = lime_explanation_forest.as_pyplot_figure()
plt.savefig('Point2_lime_forest_im' + str(plot_id) + '.png', dpi=300, bbox_inches='tight')
plot_id += 1
# 3. Calculate lime decomposition for selected observations with lime library for the forest model (POINT 2, 5)
plot_id = 1
for i in range(len(observations)):
lime_explanation_linear = lime_explainer.explain_instance(
data_row=observations.iloc[i],
predict_fn=lambda d: linear_model_load.predict(d)
)
_ = lime_explanation_linear.as_pyplot_figure()
plt.savefig('Point2_lime_linear_im' + str(plot_id) + '.png', dpi=300, bbox_inches='tight')
plot_id += 1
# 4. Calculate lime decomposition for selected observations with dalex library for the forest model (POINT 2, 5)
explainer_forest_dx = dx.Explainer(forest_reg_load, X_train_load, y_train_load, verbose = False)
plot_id = 1
for i in range(len(observations)):
explanation_forest_dx = explainer_forest_dx.predict_surrogate(observations.iloc[i])
explanation_forest_dx.plot()
plt.savefig('Point2_dalex_forest_im' + str(plot_id) + '.png', dpi=300, bbox_inches='tight')
plot_id += 1
# 5. Calculate lime decomposition for selected observations with dalex library for the linear model (POINT 2, 5)
explainer_linear_dx = dx.Explainer(linear_model_load, X_train_load, y_train_load, verbose = False)
plot_id = 1
for i in range(len(observations)):
explanation_linear_dx = explainer_linear_dx.predict_surrogate(observations.iloc[i])
explanation_linear_dx.plot()
plt.savefig('Point2_dalex_linear_im' + str(plot_id) + '.png', dpi=300, bbox_inches='tight')
plot_id += 1
# 6. Calculate Shapley values for selected observations with shap library (POINT 4)
explainer_shap = shap.TreeExplainer(forest_reg_load)
shap_values = explainer_shap.shap_values(observations)
plot_id = 1
for i in range(len(observations)):
shap.bar_plot(shap_values[i], feature_names = observations.columns, show = False)
plt.savefig('Point4_shap_im' + str(plot_id) + '.png', dpi=300, bbox_inches='tight')
plot_id += 1
# 7. Analyze stability of the method
plot_id = 1
for i in range(3):
for seed in range(5):
np.random.seed(seed)
explanation_forest_dx = explainer_forest_dx.predict_surrogate(observations.iloc[i])
explanation_forest_dx.plot()
plt.savefig('Point3_seed' + str(seed) + '_im' + str(plot_id) + '.png', dpi=300, bbox_inches='tight')
plot_id += 1